[pull] main from triton-lang:main by pull[bot] · Pull Request #1 · maryamtahhan/triton-cpu

pull · 2025-04-30T11:17:48Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

TMEMCopy pattern is using stale tmem encoding. This may cause a crash in the validator if the encodings mismatch. --------- Co-authored-by: root <root@codex-gb200-0.brix.pawelszczerbuk.svc.cluster.local> Co-authored-by: Codex <noreply@openai.com>

Co-authored-by: Adam P. Goucher <goucher@statslab.cam.ac.uk>

@vprintf

…9975) fuseFuncArgVariables calls funcOp.getBody().front() on every LLVMFuncOp, but external declarations (e.g @vprintf) have no body which causes the pass to crash. Reproducible on nvidia with LLVM_EXTRACT_DI_LOCAL_VARIABLES=1 on any kernel that uses device_print. Changes: - Skip fuseFuncArgVariables for external declarations in LLVMDILocalVariable - Relax the isIntOrFloat() assert in both LLVMDIScope and LLVMDILocalVariable to also accept VectorType, consistent with convertType() which already handles it - Add lit test

It doesn't make sense currently to have CLC in a multicta context without multicast. As such, we hide this flag and we infer it automatically. The reason why it doesn't make sense it's because in multicta just one CTA is allowed to request to cancel. In CUDA you can imagine patterns like doing CLC without multicast and then share the result from one CTA to all the others manually. We don't allow that in Gluon.

We also add tighter invariants for gather/scatter ops as well

Fix's API issues in the bench_mlp.py script. `python/triton_kernels/bench/bench_mlp.py` no longer ran with current Triton code. Running: ``` torchrun --nproc-per-node=1 python/triton_kernels/bench/bench_mlp.py ``` fails with ``` [rank0]: Traceback (most recent call last): [rank0]: File "/workspace/triton-source/python/triton_kernels/bench/bench_mlp.py", line 230, in <module> [rank0]: roofline_mlp(batch_sizes, 5760, 5760, 128, 4, dtypes[0], dtypes[1], ep, name="mlp_moe") [rank0]: File "/workspace/triton-source/python/triton_kernels/bench/bench_mlp.py", line 194, in roofline_mlp [rank0]: csv_path = roofline.compute_roofline(dim1, dim2, n_expts_tot, n_expts_act, parse_dtype(x_dtype), [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/triton_kernels/roofline.py", line 73, in compute_roofline [rank0]: perf = inject_proxy_and_call(val, args, kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.12/dist-packages/triton_kernels/roofline.py", line 64, in inject_proxy_and_call [rank0]: return bench_fn(*args_list, **kwargs) [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/workspace/triton-source/python/triton_kernels/bench/bench_mlp.py", line 100, in bench_mlp [rank0]: symm_mem_pool = SymmetricMemoryPool() [rank0]: ^^^^^^^^^^^^^^^^^^^^^ [rank0]: TypeError: SymmetricMemoryPool.__init__() missing 1 required positional argument: 'mesh' E0403 20:14:38.021000 2225 torch/distributed/elastic/multiprocessing/api.py:988] failed (exitcode: 1) local_rank: 0 (pid: 2258) of binary: /usr/bin/python3 Traceback (most recent call last): File "/usr/local/bin/torchrun", line 6, in <module> sys.exit(main()) ^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 367, in wrapper return f(*args, **kwargs) ^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 1016, in main run(args) File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 1007, in run elastic_launch( File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 184, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 332, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ python/triton_kernels/bench/bench_mlp.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2026-04-03_20:14:38 host : ab3ee0d0c408 rank : 0 (local_rank: 0) exitcode : 1 (pid: 2258) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ```

We follow TMA load / store closely.

The main beneficial optimizations are: 1. Separate loader for weight and weight scales, allowing asymmetric pipelining. Potentially we can have a separate partition for scale factor load as well, but I didn't experiment with it. 2. Epilogue optimizations to decrease the number of instructions. Epilogue instruction issue (especially with 8 warps) staves the other warps of instruction issuing. Optimizing it improves overall performance. I repurposed the 2 idle leftover warps as a store partition to decrease the critical path in the epilogue. gpt-oss-120b shapes, performance vs. `triton_kernels.matmul` on synthetic "realistic" logits. ``` GPT-OSS-120B MoE MM1 E=128 EP=4 ES=8 B=2880x5760 Peak: 5 PFLOPS, 8 TBPS batch_size example reference ---------------------------------------------------------------------------------------------------------------- 128 29.12 TFLOPS ( 0.6%) 5.27 TBPS ( 65.9%) 22.74 TFLOPS ( 0.5%) 4.11 TBPS ( 51.4%) 160 36.67 TFLOPS ( 0.7%) 5.96 TBPS ( 74.5%) 27.66 TFLOPS ( 0.6%) 4.49 TBPS ( 56.2%) 192 39.57 TFLOPS ( 0.8%) 6.00 TBPS ( 75.0%) 28.94 TFLOPS ( 0.6%) 4.39 TBPS ( 54.9%) 224 52.64 TFLOPS ( 1.1%) 5.99 TBPS ( 74.9%) 39.21 TFLOPS ( 0.8%) 4.46 TBPS ( 55.8%) 256 55.48 TFLOPS ( 1.1%) 6.46 TBPS ( 80.7%) 40.17 TFLOPS ( 0.8%) 4.67 TBPS ( 58.4%) 320 67.89 TFLOPS ( 1.4%) 6.64 TBPS ( 83.0%) 41.19 TFLOPS ( 0.8%) 4.03 TBPS ( 50.4%) 384 73.96 TFLOPS ( 1.5%) 6.43 TBPS ( 80.4%) 45.74 TFLOPS ( 0.9%) 3.98 TBPS ( 49.7%) 448 91.18 TFLOPS ( 1.8%) 6.64 TBPS ( 83.0%) 53.82 TFLOPS ( 1.1%) 3.92 TBPS ( 49.0%) 512 78.16 TFLOPS ( 1.6%) 5.36 TBPS ( 67.0%) 60.78 TFLOPS ( 1.2%) 4.17 TBPS ( 52.1%) 640 90.66 TFLOPS ( 1.8%) 5.17 TBPS ( 64.6%) 58.41 TFLOPS ( 1.2%) 3.33 TBPS ( 41.6%) 768 93.65 TFLOPS ( 1.9%) 4.98 TBPS ( 62.3%) 61.17 TFLOPS ( 1.2%) 3.25 TBPS ( 40.7%) 896 103.66 TFLOPS ( 2.1%) 5.04 TBPS ( 63.1%) 66.78 TFLOPS ( 1.3%) 3.25 TBPS ( 40.6%) 1024 135.60 TFLOPS ( 2.7%) 5.16 TBPS ( 64.5%) 86.31 TFLOPS ( 1.7%) 3.28 TBPS ( 41.0%) 1280 131.21 TFLOPS ( 2.6%) 3.80 TBPS ( 47.5%) 119.41 TFLOPS ( 2.4%) 3.46 TBPS ( 43.3%) 1536 175.65 TFLOPS ( 3.5%) 4.26 TBPS ( 53.3%) 145.39 TFLOPS ( 2.9%) 3.53 TBPS ( 44.1%) 1792 181.42 TFLOPS ( 3.6%) 3.85 TBPS ( 48.1%) 148.73 TFLOPS ( 3.0%) 3.15 TBPS ( 39.4%) 2048 188.32 TFLOPS ( 3.8%) 3.96 TBPS ( 49.5%) 165.47 TFLOPS ( 3.3%) 3.48 TBPS ( 43.5%) 2560 205.57 TFLOPS ( 4.1%) 3.45 TBPS ( 43.2%) 177.60 TFLOPS ( 3.6%) 2.98 TBPS ( 37.3%) 3072 263.31 TFLOPS ( 5.3%) 3.62 TBPS ( 45.2%) 253.05 TFLOPS ( 5.1%) 3.48 TBPS ( 43.5%) 3584 283.21 TFLOPS ( 5.7%) 3.38 TBPS ( 42.2%) 290.26 TFLOPS ( 5.8%) 3.46 TBPS ( 43.2%) 4096 348.60 TFLOPS ( 7.0%) 3.61 TBPS ( 45.2%) 329.95 TFLOPS ( 6.6%) 3.42 TBPS ( 42.8%) 5120 421.37 TFLOPS ( 8.4%) 3.57 TBPS ( 44.6%) 385.63 TFLOPS ( 7.7%) 3.27 TBPS ( 40.8%) 6144 524.65 TFLOPS ( 10.5%) 3.69 TBPS ( 46.1%) 473.72 TFLOPS ( 9.5%) 3.33 TBPS ( 41.6%) 7168 626.20 TFLOPS ( 12.5%) 3.79 TBPS ( 47.4%) 582.67 TFLOPS ( 11.7%) 3.53 TBPS ( 44.1%) 8192 675.19 TFLOPS ( 13.5%) 3.64 TBPS ( 45.6%) 629.79 TFLOPS ( 12.6%) 3.40 TBPS ( 42.5%) 9216 729.35 TFLOPS ( 14.6%) 3.62 TBPS ( 45.2%) 674.98 TFLOPS ( 13.5%) 3.35 TBPS ( 41.9%) 10240 749.61 TFLOPS ( 15.0%) 3.40 TBPS ( 42.4%) 685.04 TFLOPS ( 13.7%) 3.10 TBPS ( 38.8%) 11264 806.46 TFLOPS ( 16.1%) 3.34 TBPS ( 41.8%) 721.29 TFLOPS ( 14.4%) 2.99 TBPS ( 37.3%) 12288 895.78 TFLOPS ( 17.9%) 3.38 TBPS ( 42.3%) 805.38 TFLOPS ( 16.1%) 3.04 TBPS ( 38.0%) 13312 1008.57 TFLOPS ( 20.2%) 3.53 TBPS ( 44.1%) 921.30 TFLOPS ( 18.4%) 3.22 TBPS ( 40.3%) 14336 1001.71 TFLOPS ( 20.0%) 3.24 TBPS ( 40.5%) 915.14 TFLOPS ( 18.3%) 2.96 TBPS ( 37.0%) 15360 1108.54 TFLOPS ( 22.2%) 3.35 TBPS ( 41.8%) 1020.08 TFLOPS ( 20.4%) 3.08 TBPS ( 38.5%) 16384 1182.28 TFLOPS ( 23.6%) 3.41 TBPS ( 42.6%) 1083.60 TFLOPS ( 21.7%) 3.12 TBPS ( 39.0%) 17408 1255.35 TFLOPS ( 25.1%) 3.44 TBPS ( 43.0%) 1106.65 TFLOPS ( 22.1%) 3.03 TBPS ( 37.9%) 18432 1339.98 TFLOPS ( 26.8%) 3.50 TBPS ( 43.7%) 1188.63 TFLOPS ( 23.8%) 3.10 TBPS ( 38.8%) 19456 1411.14 TFLOPS ( 28.2%) 3.49 TBPS ( 43.6%) 1239.33 TFLOPS ( 24.8%) 3.06 TBPS ( 38.3%) 20480 1465.49 TFLOPS ( 29.3%) 3.46 TBPS ( 43.3%) 1329.22 TFLOPS ( 26.6%) 3.14 TBPS ( 39.2%) 21504 1429.77 TFLOPS ( 28.6%) 3.23 TBPS ( 40.3%) 1228.47 TFLOPS ( 24.6%) 2.77 TBPS ( 34.7%) 22528 1432.90 TFLOPS ( 28.7%) 3.08 TBPS ( 38.5%) 1283.04 TFLOPS ( 25.7%) 2.76 TBPS ( 34.5%) 23552 1530.40 TFLOPS ( 30.6%) 3.17 TBPS ( 39.6%) 1304.38 TFLOPS ( 26.1%) 2.70 TBPS ( 33.8%) 24576 1531.43 TFLOPS ( 30.6%) 3.05 TBPS ( 38.1%) 1378.61 TFLOPS ( 27.6%) 2.74 TBPS ( 34.3%) 25600 1609.22 TFLOPS ( 32.2%) 3.09 TBPS ( 38.7%) 1403.54 TFLOPS ( 28.1%) 2.70 TBPS ( 33.7%) 26624 1591.86 TFLOPS ( 31.8%) 2.96 TBPS ( 37.0%) 1461.78 TFLOPS ( 29.2%) 2.72 TBPS ( 34.0%) 27648 1708.82 TFLOPS ( 34.2%) 3.08 TBPS ( 38.5%) 1554.22 TFLOPS ( 31.1%) 2.80 TBPS ( 35.0%) 28672 1715.65 TFLOPS ( 34.3%) 3.00 TBPS ( 37.5%) 1582.07 TFLOPS ( 31.6%) 2.77 TBPS ( 34.6%) 29696 1756.95 TFLOPS ( 35.1%) 2.98 TBPS ( 37.3%) 1562.01 TFLOPS ( 31.2%) 2.65 TBPS ( 33.1%) 30720 1801.20 TFLOPS ( 36.0%) 2.97 TBPS ( 37.2%) 1681.00 TFLOPS ( 33.6%) 2.77 TBPS ( 34.7%) 31744 1979.10 TFLOPS ( 39.6%) 3.17 TBPS ( 39.7%) 1754.99 TFLOPS ( 35.1%) 2.82 TBPS ( 35.2%) ``` And on uniform logits ``` batch_size example reference ---------------------------------------------------------------------------------------------------------------- 128 73.67 TFLOPS ( 1.5%) 5.41 TBPS ( 67.7%) 60.70 TFLOPS ( 1.2%) 4.46 TBPS ( 55.7%) 160 96.46 TFLOPS ( 1.9%) 5.50 TBPS ( 68.7%) 78.10 TFLOPS ( 1.6%) 4.45 TBPS ( 55.6%) 192 115.43 TFLOPS ( 2.3%) 5.43 TBPS ( 67.9%) 94.71 TFLOPS ( 1.9%) 4.45 TBPS ( 55.7%) 224 141.31 TFLOPS ( 2.8%) 5.47 TBPS ( 68.3%) 115.26 TFLOPS ( 2.3%) 4.46 TBPS ( 55.7%) 256 155.77 TFLOPS ( 3.1%) 5.47 TBPS ( 68.4%) 126.31 TFLOPS ( 2.5%) 4.44 TBPS ( 55.5%) 320 205.17 TFLOPS ( 4.1%) 5.46 TBPS ( 68.2%) 161.57 TFLOPS ( 3.2%) 4.30 TBPS ( 53.7%) 384 251.10 TFLOPS ( 5.0%) 5.50 TBPS ( 68.8%) 195.10 TFLOPS ( 3.9%) 4.28 TBPS ( 53.5%) 448 295.50 TFLOPS ( 5.9%) 5.53 TBPS ( 69.2%) 228.63 TFLOPS ( 4.6%) 4.28 TBPS ( 53.5%) 512 316.07 TFLOPS ( 6.3%) 5.34 TBPS ( 66.7%) 254.07 TFLOPS ( 5.1%) 4.29 TBPS ( 53.6%) 640 414.90 TFLOPS ( 8.3%) 5.43 TBPS ( 67.9%) 293.48 TFLOPS ( 5.9%) 3.84 TBPS ( 48.0%) 768 489.21 TFLOPS ( 9.8%) 5.43 TBPS ( 67.9%) 350.19 TFLOPS ( 7.0%) 3.89 TBPS ( 48.6%) 896 553.62 TFLOPS ( 11.1%) 5.39 TBPS ( 67.4%) 405.50 TFLOPS ( 8.1%) 3.95 TBPS ( 49.4%) 1024 576.71 TFLOPS ( 11.5%) 4.92 TBPS ( 61.5%) 463.70 TFLOPS ( 9.3%) 3.95 TBPS ( 49.4%) 1280 682.95 TFLOPS ( 13.7%) 4.76 TBPS ( 59.5%) 571.87 TFLOPS ( 11.4%) 3.98 TBPS ( 49.8%) 1536 837.25 TFLOPS ( 16.7%) 4.99 TBPS ( 62.4%) 679.20 TFLOPS ( 13.6%) 4.05 TBPS ( 50.6%) 1792 934.09 TFLOPS ( 18.7%) 4.84 TBPS ( 60.5%) 686.46 TFLOPS ( 13.7%) 3.56 TBPS ( 44.5%) 2048 937.04 TFLOPS ( 18.7%) 4.25 TBPS ( 53.2%) 714.85 TFLOPS ( 14.3%) 3.24 TBPS ( 40.6%) 2560 1081.82 TFLOPS ( 21.6%) 4.07 TBPS ( 50.9%) 1020.85 TFLOPS ( 20.4%) 3.84 TBPS ( 48.0%) 3072 1313.97 TFLOPS ( 26.3%) 4.21 TBPS ( 52.7%) 1209.26 TFLOPS ( 24.2%) 3.88 TBPS ( 48.5%) 3584 1533.25 TFLOPS ( 30.7%) 4.27 TBPS ( 53.3%) 1410.67 TFLOPS ( 28.2%) 3.93 TBPS ( 49.1%) 4096 1399.52 TFLOPS ( 28.0%) 3.42 TBPS ( 42.7%) 1245.96 TFLOPS ( 24.9%) 3.04 TBPS ( 38.1%) 5120 1489.01 TFLOPS ( 29.8%) 2.97 TBPS ( 37.1%) 1320.69 TFLOPS ( 26.4%) 2.63 TBPS ( 32.9%) 6144 1901.82 TFLOPS ( 38.0%) 3.21 TBPS ( 40.2%) 1638.02 TFLOPS ( 32.8%) 2.77 TBPS ( 34.6%) 7168 2234.87 TFLOPS ( 44.7%) 3.28 TBPS ( 41.0%) 1987.25 TFLOPS ( 39.7%) 2.92 TBPS ( 36.5%) 8192 1999.71 TFLOPS ( 40.0%) 2.59 TBPS ( 32.4%) 1747.04 TFLOPS ( 34.9%) 2.26 TBPS ( 28.3%) 9216 2024.82 TFLOPS ( 40.5%) 2.38 TBPS ( 29.7%) 1802.45 TFLOPS ( 36.0%) 2.12 TBPS ( 26.5%) 10240 2233.54 TFLOPS ( 44.7%) 2.38 TBPS ( 29.8%) 1987.52 TFLOPS ( 39.8%) 2.12 TBPS ( 26.5%) 11264 2423.11 TFLOPS ( 48.5%) 2.40 TBPS ( 30.0%) 2163.00 TFLOPS ( 43.3%) 2.14 TBPS ( 26.8%) 12288 2230.25 TFLOPS ( 44.6%) 2.05 TBPS ( 25.7%) 1971.60 TFLOPS ( 39.4%) 1.82 TBPS ( 22.7%) 13312 2386.76 TFLOPS ( 47.7%) 2.07 TBPS ( 25.9%) 2089.59 TFLOPS ( 41.8%) 1.81 TBPS ( 22.7%) 14336 2535.85 TFLOPS ( 50.7%) 2.08 TBPS ( 25.9%) 2214.30 TFLOPS ( 44.3%) 1.81 TBPS ( 22.7%) 15360 2708.63 TFLOPS ( 54.2%) 2.10 TBPS ( 26.3%) 2371.09 TFLOPS ( 47.4%) 1.84 TBPS ( 23.0%) 16384 2495.78 TFLOPS ( 49.9%) 1.85 TBPS ( 23.1%) 2256.94 TFLOPS ( 45.1%) 1.67 TBPS ( 20.9%) 17408 2613.94 TFLOPS ( 52.3%) 1.85 TBPS ( 23.1%) 2350.97 TFLOPS ( 47.0%) 1.66 TBPS ( 20.8%) 18432 2600.95 TFLOPS ( 52.0%) 1.76 TBPS ( 22.0%) 2376.24 TFLOPS ( 47.5%) 1.61 TBPS ( 20.1%) 19456 2720.43 TFLOPS ( 54.4%) 1.77 TBPS ( 22.1%) 2445.94 TFLOPS ( 48.9%) 1.59 TBPS ( 19.9%) 20480 2694.19 TFLOPS ( 53.9%) 1.68 TBPS ( 21.0%) 2418.54 TFLOPS ( 48.4%) 1.51 TBPS ( 18.9%) 21504 2667.33 TFLOPS ( 53.3%) 1.61 TBPS ( 20.1%) 2432.76 TFLOPS ( 48.7%) 1.47 TBPS ( 18.4%) 22528 2773.18 TFLOPS ( 55.5%) 1.62 TBPS ( 20.3%) 2531.76 TFLOPS ( 50.6%) 1.48 TBPS ( 18.5%) 23552 2867.55 TFLOPS ( 57.4%) 1.62 TBPS ( 20.3%) 2604.20 TFLOPS ( 52.1%) 1.48 TBPS ( 18.4%) 24576 2736.19 TFLOPS ( 54.7%) 1.50 TBPS ( 18.8%) 2499.52 TFLOPS ( 50.0%) 1.37 TBPS ( 17.2%) 25600 2789.45 TFLOPS ( 55.8%) 1.49 TBPS ( 18.6%) 2554.81 TFLOPS ( 51.1%) 1.37 TBPS ( 17.1%) 26624 2882.35 TFLOPS ( 57.6%) 1.50 TBPS ( 18.7%) 2669.93 TFLOPS ( 53.4%) 1.39 TBPS ( 17.4%) 27648 2856.47 TFLOPS ( 57.1%) 1.45 TBPS ( 18.1%) 2624.29 TFLOPS ( 52.5%) 1.33 TBPS ( 16.7%) 28672 2851.52 TFLOPS ( 57.0%) 1.41 TBPS ( 17.7%) 2621.20 TFLOPS ( 52.4%) 1.30 TBPS ( 16.2%) 29696 2826.90 TFLOPS ( 56.5%) 1.37 TBPS ( 17.1%) 2483.67 TFLOPS ( 49.7%) 1.20 TBPS ( 15.1%) 30720 2894.51 TFLOPS ( 57.9%) 1.37 TBPS ( 17.2%) 2527.41 TFLOPS ( 50.5%) 1.20 TBPS ( 15.0%) 31744 2875.63 TFLOPS ( 57.5%) 1.34 TBPS ( 16.7%) 2505.68 TFLOPS ( 50.1%) 1.16 TBPS ( 14.5%) ```

[GSan] Model TMAReduceOp as atomic Previously we were modelling it as a non-atomic store. This now models it as atomic, but is still not 100% right as we're doing the shadow cell update eagerly before the TMA operation has completed.

…ale with nvfp4 output (#9854) nvfp4 has a tensor-wide scale. This PR adds support for this scale when the matmul output needs to be quantized to nvfp4. We use flex ctx to carry the scale.  # New contributor declaration - [ ] I am not making a trivial change, such as fixing a typo in a comment. - [ ] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [ ] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

* Convert `mask` to a tensor so it broadcasts correctly in gluon. NFC for triton * fix the `tl_obj_*` helpers * add an extra hook for rewriting some argument annotations

We make things a bit more uniform by decomposing the xor mask and emitting instructions accordingly. For a `mask` in [1, 15], on RDNA + gfx1250 we use a single `row_xmask` DPP instruction, and on CDNA we use 1 or 2 DPP instructions. For a `mask >= 16`, on RDNA, we use a single `v_permlanex16`, and on CDNA we use `ds_bpermute`. We also pull some static utility functions into an anonymous namespace and remove the ShflKind::down case from the enum as it was unimplemented.

Improve f16 gemm gfx1250-gluon performance. Improves gemm_tdm_pipelined_single_warp_per_simd_schedule_kernel by moving tdm.load earlier; from the top of the loop (which hides 3/4th of a loop-iteration's worth of cycles) to right after the wait (which hides a full loop-iteration's worth of cycles). This only fixes the mentioned kernel; other kernels need independent benchmarking and improving.

CLC gets its own partition, running over threads 48-63. We model CLC as we model TMA writes, via a Barrier::EffectWrites. The idea of this mode is that we link all the writes on the op to the barrier. We also annotate in the table `barrierWriteRecipients` which CTAs will become visible once we wait on the associated barrier. We note something interesting and document it. `BarrierTrackingMode::Frontier` should be used when we have a commit/arrive/expect op that affects anything in flight before it. Instead, we use `BarrierTrackingMode::EffectWrites` when the PTX op accepts a barrier so the barrier just signals the completion of the op's particular write. The other point we add is a flag `bool diagonalEffectRecipientCTAs`. This differentiates the behaviour between TMA, where after waiting on the barrier you see all the writes from all the CTAs in the multicas group, vs. the diagonal version, as in CLC, where waiting on CTAi just makes the thread see the CTAi memory.

@leijurv

Hello! The coalescing pass currently chooses `sizePerThread` for atomics using the same pointer alignment / contiguity heuristic as regular stores. For some atomic lowerings this is too optimistic: the pass will select a 128-bit per-thread layout even when the backend lowers the atomic as a narrower operation. That creates gaps in the warp-level write pattern and can hurt throughput significantly. On the attached gh200 atomic-add microbenchmark, this PR improves Triton from: - `int32`: 17.98 -> 72.86 Gupdates/s (~4.1x) - `int64`: 19.29 -> 39.21 Gupdates/s (~2.0x) - `fp64`: 13.42 -> 26.79 Gupdates/s (~2.0x) These numbers match the best Gluon blocked layouts, which use `size_per_thread = 1` for `int32` / `int64` / `fp64`. The attached microbenchmark focuses on atomic add, and we have only tested on Hopper, but we think it could be worth building a richer lowering path for different compute capabilities and dtypes, as more people realise the performance benefits of associative int32 atomics. Thanks @leijurv for input on these commits. <details> <summary><b>Raw results</b> (click to expand)</summary> ``` int32 results: baseline triton atomic took 382.13 ms @ 17.98 Gupdates/s patched triton atomic took 94.32 ms @ 72.86 Gupdates/s Gluon [8] atomic took 706.87 ms @ 9.72 Gupdates/s Gluon [4] atomic took 381.84 ms @ 18.00 Gupdates/s Gluon [2] atomic took 197.41 ms @ 34.81 Gupdates/s Gluon [1] atomic took 94.88 ms @ 72.43 Gupdates/s int64 results: baseline triton atomic took 356.30 ms @ 19.29 Gupdates/s patched triton atomic took 175.25 ms @ 39.21 Gupdates/s Gluon [8] atomic took 773.59 ms @ 8.88 Gupdates/s Gluon [4] atomic took 684.48 ms @ 10.04 Gupdates/s Gluon [2] atomic took 356.34 ms @ 19.28 Gupdates/s Gluon [1] atomic took 174.56 ms @ 39.37 Gupdates/s fp64 results: baseline triton atomic took 512.20 ms @ 13.42 Gupdates/s patched triton atomic took 256.55 ms @ 26.79 Gupdates/s Gluon [8] atomic took 773.32 ms @ 8.89 Gupdates/s Gluon [4] atomic took 683.84 ms @ 10.05 Gupdates/s Gluon [2] atomic took 512.01 ms @ 13.42 Gupdates/s Gluon [1] atomic took 253.05 ms @ 27.16 Gupdates/s ``` </details> <details> <summary><b>Microbenchmark code</b> (click to expand)</summary> ```py import torch import triton import triton.language as tl import triton.experimental.gluon as gluon import triton.experimental.gluon.language as gl @triton.jit def triton_atomic_add_kernel(input, output, ELEMENTS_PER_PID: tl.constexpr, INNER_REPEATS: tl.constexpr, SEM: tl.constexpr = "relaxed"): program_offset = tl.program_id(axis=0) * ELEMENTS_PER_PID load_store_indices = program_offset + tl.arange(0, ELEMENTS_PER_PID) load_indices = input + load_store_indices input = tl.load(load_indices) store_indices = output + load_store_indices for _ in tl.static_range(INNER_REPEATS): tl.atomic_add(store_indices, input, sem=SEM) @gluon.jit def gluon_atomic_add_kernel(input, output, ELEMENTS_PER_PID: gl.constexpr, INNER_REPEATS: gl.constexpr, SIZE_PER_THREAD: gl.constexpr, SEM: gl.constexpr = "relaxed"): LAYOUT: gl.constexpr = gl.BlockedLayout(size_per_thread=[SIZE_PER_THREAD], threads_per_warp=[32], warps_per_cta=[4], order=[0]) program_offset = gl.program_id(axis=0) * ELEMENTS_PER_PID load_store_indices = program_offset + gl.arange(0, ELEMENTS_PER_PID, layout=LAYOUT) load_indices = input + load_store_indices input = gl.load(load_indices) store_indices = output + load_store_indices for _ in gl.static_range(INNER_REPEATS): gl.atomic_add(store_indices, input, sem=SEM) def make_input(total_elements, dtype): if dtype.is_floating_point: return torch.rand((total_elements,), dtype=dtype) else: return torch.randint(-1000, 1000, (total_elements,), dtype=dtype) if __name__ == "__main__": torch.set_default_device('cuda') total_elements = 2**30 ELEMENTS_PER_PID = 2**9 grid = total_elements // ELEMENTS_PER_PID INNER_REPEATS = 2**6 for dtype in [torch.int32, torch.int64, torch.float32, torch.float64,]: print(f'\tDTYPE: {dtype}') input = make_input(total_elements, dtype) output_triton = torch.zeros_like(input) def triton_kernel_launcher(): triton_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, num_warps=4) ms_triton = triton.testing.do_bench(triton_kernel_launcher) triton_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_triton / 10e6 print(f'Triton atomic took \t{ms_triton:.2f} ms @ {triton_gupdates_per_sec:.2f} Gupdates/s') for SIZE_PER_THREAD in [8, 4, 2, 1]: def gluon_kernel_launcher(): gluon_atomic_add_kernel[(grid,)](input, output_triton, ELEMENTS_PER_PID, INNER_REPEATS, SIZE_PER_THREAD, num_warps=4) ms_gluon = triton.testing.do_bench(gluon_kernel_launcher) gluon_gupdates_per_sec = float(total_elements * INNER_REPEATS) / ms_gluon / 10e6 print(f'Gluon [{SIZE_PER_THREAD}] atomic took \t{ms_gluon:.2f} ms @ {gluon_gupdates_per_sec:.2f} Gupdates/s') ``` </details> # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

The test has been failing increasingly freqently. Disable while investigating.

…10066) The subslice safety check in lowerLdStMatrix uses a bitmask to verify that affine offsets don't touch the contiguous part of the tile's offset dimension. It was using getOutDimSizeLog2 (which returns log2 of the size) instead of getOutDimSize (the actual size) to construct this mask. For outDimSize=8: log2(8)-1 = 2 (0b010) only checks bit 1, whereas 8-1 = 7 (0b111) correctly checks all bits within the tile span. The bug makes the check too permissive — it could allow subslices that overlap the contiguous tile region. Latent because the specific bit patterns in maskSpanAffineOffset rarely trigger the difference. Fix both the NVIDIA (Utility.cpp) and AMD (MemoryOpToLLVM.cpp) backends. # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - [x] This PR does not need a test because existing tests pass, and the fix only makes the safety check stricter . - [x] I have not added any `lit` tests.

…ons. (#10060) The fp64 MMA path now operates at native `m8n8k4` granularity, supporting any shape that is a multiple of 8×8×4, including the minimal 8×8×4 case. This is an extension of #7310 (The implementation was based on that PR) Tests passed on A100. ## Files Changed ### `lib/Dialect/TritonGPU/IR/Dialect.cpp` - `getRepForOperand`: Changed `tileBitWidthK` from `2 * 256` to `1 * 256` for fp64 (K-tile = 4). Changed `tileSize[M]` from hardcoded `16` to `8` for fp64. ### `lib/Dialect/TritonGPU/Transforms/Utility.cpp` - `mmaVersionToInstrShape`: Returns `instrShape[M] = 8` for fp64 (was always 16). ### `lib/Dialect/TritonGPU/IR/LinearLayoutConversions.cpp` - `nvidiaDotToLinearLayout`: Uses `instrShape` from the MMA encoding for tile shape computation. K tile multiplier is 4 (not 8) when `instrM == 8`. ### `third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/DotOpToLLVM/MMAv2.cpp` - `getMmaRetType`: fp64 returns `struct{f64, f64}` (2 elements) instead of `struct{f64, f64, f64, f64}` (4 elements). - `callMmaAmpereFp64`: Extended: Now it is able to also emit a single `m8n8k4` instruction per call (single retArgs(2), aArgs(1), bArgs(1), cArgs(2)). - `numRegisters`: `{1, 1, 1}` for fp64 (was effectively `{2, 1, 2}`). - `numMmaRets`: 2 for fp64 (was 4). - `numCPackedElem`: 1 for fp64 (was incorrectly computed). - fc indexing formula: Uses `numMmaRets * numCPackedElem` instead of hardcoded `4`. ### `third_party/nvidia/backend/compiler.py` - `min_dot_size`: Added `elif lhs_bitwidth == 64: return (1, 1, 4)` to allow K=4 for fp64. ### `python/test/unit/language/test_core.py` - Added small fp64 test cases: `(8,8,4)`, `(8,8,8)`, `(16,8,4)`, `(8,8,16)` with `num_warps=1`. ### `test/Conversion/tritongpu_to_llvm.mlir` - Updated `f64_mma_cvt` test to use `instrShape = [8, 8]` matching the new fp64 encoding. ---- # New contributor declaration - [X] I am not making a trivial change, such as fixing a typo in a comment. - [X] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [X] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [X] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [X] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

Partially reverts: - #9730 it still keeps/adjusts the improved verifiers to error out if the kernel provides invalid strides. This brings it in line with TMA which also only supports the logical HW view and transposes the view in the kernel instead.

This renames both the ampere style async_copy and tma functions. The original names are pretty much copied directly from ptx and while they are very descriptive, are also rather verbose. IMO these shorter names have a better glance-value and better conveys the intent of the programmer. I've also kept the original names as aliases to preserve backward-compatibility.

#10085) Reverts #10054 given now the mirror out of sync isue is gone.

This PR: - Enables CS and CV load cache modifier in Nvidia backend - Enables CS cache modifier in frontend - Improves test coverage for CDNA3 and CDNA4 --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

…10062) Extends the AMD ds_read_tr* local-load lowering to accept PartitionedSharedEncodingAttr as the source encoding. Previously the pattern bailed out as soon as it saw a partitioned shared encoding, forcing a slower generic local-load lowering for all WMMA dot-operand loads from partitioned LDS buffers. --------- Co-authored-by: Ognjen Plavsic <plognjen@amd.com>

Fpsan insufficiently synchronized global scratch accesses - we were missing a barrier in between memory accesses, which led to random failures in the tests and racy behavior.

…#10356) The current docstrings for `tl.multiple_of`, `tl.max_contiguous`, and `tl.max_constancy` do not match the underlying semantics modeled by `AxisInfo`. `multiple_of` describes its argument as "multiples of value" without explaining how the hint composes with contiguity, and the other two use the phrase "the value first values in input are contiguous/constant", which conflates the hint with its per-dimension length argument and confused at least one reader (#1324). This PR rewrites each docstring to mirror the precise definitions in `include/triton/Analysis/AxisInfo.h`. `multiple_of` now states that `values[d]` is the largest power of two that divides the first element of every contiguous group along dimension `d`. `max_contiguous` and `max_constancy` now say that the input forms contiguous (resp. constant) groups of length `values[d]` along dimension `d`, that `values` must have one entry per dimension, and that each entry must be a power of two. Each docstring includes a concrete 1D example taken from `AxisInfo.h` so the meaning is obvious from the rendered Sphinx page. `multiple_of` and `max_contiguous` now cross-reference each other via `:func:` so users can see how they compose to enable vectorized memory accesses. Behavior is unchanged; this is documentation only. Closes #1324.

they can still separately be overriden in the opt flags constraints

`K`-ragged metadata is not row-segment metadata for `BlackwellActMXScaleLayout`; passing it corrupts matmul output with swizzled activation scales. Test: GB200 `test_k_ragged_mxfp8_act_scale_swizzling` fails on `main` (77.3% mismatched elements) and passes here.

This solves issue #10341 When specifying the `acc` argument of `tl.dot` with dtype `tl.float16` without `out_dtype=tl.float16` raises an `AssertionError` because the default for `out_dtype` is hard-coded to be `tl.float32`. This PR changes the default `out_dtype` to be `None` to allow checking whether a value has been provided. This again allows to use the dtype of `acc` as the default for `out_dtype` (if the `acc` argument is specified). # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

…10386) More accurately account the smem usage of the persistent tf32 matmul kernel. This makes it so that we correctly enable 4-stage for tf32 persistent matmul when B is provided in the correct layout. For M=N=K=4096 matmul this leads to an improvement of about ~13% on GB200, from 560 TFLOP/s to 630 TFLOP/s. Validated that both NNN and NNT matmuls still run. Note that NNN does _not_ run with 4 stages, so some level of accounting / capping is necessary.  # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [ ] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [x] This PR does not need a test because `it is a straightforward heuristics change`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.)

…put contiguity (#10359) Fixes #10067. In `SelectOpAxisInfoVisitor`'s tensor-cond branch, the call to `getDivisibilityFromContiguity` only sees the lhs/rhs contiguities and can overestimate divisibility when `condConstancy` further reduces the output contiguity below either input's contiguity. ### Concrete example - lhs: `[8, 9, 10, 11, 12, 13, 14, 15]` (c=8, d=8) - rhs: `[16, 17, 18, 19, 20, 21, 22, 23]` (c=8, d=16) - Element-wise condition, so `condConstancy = 1` Output contiguity collapses to `gcd(8, 8, 1) = 1` (every position is a leader). But `getDivisibilityFromContiguity` sees `c_lhs == c_rhs == 8` and returns `gcd(8, 16) = 8` — without accounting for `condConstancy`. The output value at position 1 may be 17, which is not divisible by 8. ### Why this didn't blow up On the current pow2 lattice, `gcd == min` on powers of 2. Codegen's `vec_width = min(c, d/e, ...)` is bounded by contiguity, and contiguity is computed correctly. So the overestimated divisibility is never the binding constraint on `vec_width`. This is a latent soundness regression introduced by #7781 — the pre-#7781 code clamped `divisibility` against the just-computed output contiguity, and that was sound independent of the pow2 invariant. ### Fix A conditional GCD with the output contiguity at the SelectOp callsite. The clamp fires only when `condConstancy` (or another shrinking factor) reduces the output contiguity strictly below at least one input's contiguity — i.e. it preserves the existing semantics when `condConstancy` is non-binding. The helper `getDivisibilityFromContiguity` is left unchanged (its other callers — `MaxMinOpAxisInfoVisitor` and `AxisInfo::join` — don't have a `condConstancy`-equivalent, so the same gap doesn't exist there). ### Test Added `select_cond_constancy_clamps_divisibility` in `test/Analysis/test-alignment.mlir`. The test fails before the fix (`divisibility = [8]`) and passes after (`divisibility = [1]`). # New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [ ] I have not added any `lit` tests. - [x] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) --------- Co-authored-by: Keren Zhou <kerenzhou@openai.com>

#10390) `InThreadTranspose` rewrites `tt.load -> ttg.local_alloc -> ttg.local_load -> dot_op` so the K-contiguous WMMA/MFMA operand can be read from LDS as wide `ds_load_b128` instead of scalar `ds_load_u16` pairs when the load order doesn't match the consumer's K dimension. The pattern matcher in `matchInThreadTransposePattern` already accepts `AMDWmmaEncodingAttr` alongside `AMDMfmaEncodingAttr`, but the gate in `is_in_thread_transpose_enabled` only activates the pass on gfx942 (CDNA3) and gfx120x (RDNA4, enabled in #10185). Extend it to also cover RDNA3 (gfx110x/gfx1103) and RDNA3.5 (gfx115x). Added a `inThreadTranspose_wmma` sub-test to `test/TritonGPU/amd/in-thread-transpose.mlir` (gfx1151, wave32, WMMA encoding) that verifies the pass produces an `amdg.in_thread_transpose` op and that the downstream `ttg.local_load` returns the K-contiguous `dot_op` layout (`kWidth = 16`). On AITER's `flash_attn_2.varlen_fwd` at the Qwen3-Omni ViT prefill shape (B=1, S=3200, H=16, head_dim=72, fp16) on gfx1151, this lifts the inner-loop V `local_load` from 512 scalar `ds_load_u16(_d16_hi)` to 144 vectorized `ds_load_b128` and gives a 3.8% median speedup (3.042 -> 2.925 ms).

…ut (#10345)

Motivation for this PR is to enable kernel authors to feed/select different llvm fn attributes locally at a kernel level scope as optimizations/llvm fn attributes could differ between kernels. This PR has 3 main changes: 1. Extend the LLVM function binding so add_fn_attr can emit valueless attributes. LLVM distinguishes bare function attributes from string-valued attributes, and AMD kernel attrs may need either form. 2. Add the HIP llvm_fn_attrs option so kernel authors can attach comma-separated LLVM function attributes to their own kernel during compilation without changing global backend state. 3. Add AMD tests covering single attributes, multiple attributes including a bare attr, and a sched-strategy attr that changes generated AMDGCN. Signed-off-by: Stanley Winata <stanley.winata@amd.com>

…layout (#10392) It is to the fix problem arises when `layoutToGluon()` converting a `PaddedSharedEncodingAttr` (w/ multicta) into a PaddedSharedLayout object. ### Problem Consider the case where PaddedSharedEncodingAttr has shape=[X, 128] and CGA-layout=[0,1], * Before the fix: `PaddedSharedLayout.cga_layout` is built directly from `PaddedSharedEncodingAttr`'s "block" basis. It will be `[0, 64=128/2]` * After the fix: `PaddedSharedLayout.cga_layout` will be `[0, 1]` The rationale can also be extrapolated from the `PaddedSharedLayout::_to_ir()` where the linear-layout is obtained by `operator*(PaddedSharedLayout.offset_bases, PaddedSharedLayout.cga_layout)` via function `get_padded_shared_layout()` ### Test * command `python3 -m pytest -v python/test/gluon/test_frontend.py::test_infer_layout_for_padded_shared` * the `ttgl.static_assert()` in the kernel makes sure CGA layout is expected

Currently, running `python python/examples/gluon/01-attention-forward.py` fails with ``` python/examples/gluon/01-attention-forward.py:386:14: error: source and result must have the same logical storage size (4096 vs 2048) p_tmem = s_tmem._reinterpret(config.dtype, [config.SPLIT_M, 2 * config.BLOCK_N], config.p_tmem_layout) ``` This happens when the benchmark is run with fp8. `pytest python/examples/gluon/01-attention-forward.py` doesn't fail simply because it doesn't test on fp8. The regression is from #10243 but the problematic code hardcoding a fixed column count like `config.BLOCK_N // 2` has been there for a long time. Previously it only worked due to the loose convention on reinterpret.

@lezcano

## Summary Improve error diagnostics for user-facing `assert` statements in `semantic.py` and `core.py` by adding descriptive messages following the "expected X, got Y" pattern. As requested by @lezcano in #10341 — previously, users hitting these asserts (e.g., mismatched accumulator dtype in `tl.dot`) would get an empty `AssertionError` with no guidance on what went wrong. ## Changes - `semantic.py`: 17 bare asserts → descriptive messages - `core.py`: 5 bare asserts → descriptive messages - Split compound `assert shape == X and dtype == Y` into two separate asserts for clearer diagnostics

This reverts commit 36394d4. We saw some compile failure cases with disabled vector combine. But should be resolved and should re-land this commit once with llvm/llvm-project#193499 or commit `81d618b6bc1e71cda79fe7bf9cbab63933dd5975` gets picked up in the next LLVM bump.

This PR adds check that tensor descriptor and smem encodings are equal. --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

We expect two different types of failures in this 2CTA race: - The 1 cta one - Have CTA1 race ahead (as it's doing nothing on tcgen05 ops) then load/store smem, and invalidate the barrier, before the barrier has finished being used by tcgen05_commit. We also remove the TWO_CTA flag in the TMA barrier as it doesn't make sense here.

Update the separate Blackwell ptxas pin from CUDA 13.1.80 to CUDA 13.3.33 so Blackwell builds use the CUDA 13.3 assembler artifact.

Use `string_view` and `packFixStrLiteral`

Introduce a build number to the LLVM build to avoid accidentally changing the LLVM that is used when building Triton. This allows us to have multiple builds from the same LLVM hash, but ensures that a given Triton commit builds deterministically because it is tied to the build number. Use the sha sum as well to enforce this.

Make `triton_kernels.tensor.convert_layout` return its input when its storage already represents the requested layout. ## Motivation Callers should not need to guard an idempotent layout conversion. This avoids redundant unswizzle/swizzle copies for already-correct MXFP intermediates. ### Strided layouts `StridedLayout` identifies the packed dimension, not compact physical strides. For example: ```python pitched = torch.empty_strided((64, 128), (256, 1)) tensor = wrap_torch_tensor(pitched) assert tensor.storage.layout == StridedLayout(-1) converted = convert_layout(tensor, StridedLayout(-1)) ``` Before this PR, the matching-layout conversion still copied through canonical storage and incidentally densified the tensor: ```python assert converted is not tensor assert converted.storage.data.stride() == (128, 1) ``` After this PR, the existing valid storage is preserved: ```python assert converted is tensor assert converted.storage.data.stride() == (256, 1) ``` Both strides are valid `StridedLayout(-1)` storage. Kernels and TMA checks consume physical strides separately. Invalid strided storage with no contiguous dimension is still rejected. ### Swizzled layouts Matching swizzled layouts are also already in the requested storage encoding. For example: ```python swizzled = convert_layout(tensor, HopperMXScaleLayout(-2, 8)) assert convert_layout(swizzled, HopperMXScaleLayout(-2, 8)) is swizzled ``` This avoids an unnecessary unswizzle/swizzle round trip and preserves any layout-specific padded storage. A different parameterized layout still converts normally: ```python assert convert_layout(swizzled, HopperMXScaleLayout(-2, 4)) is not swizzled ``` `convert_layout` changes storage encoding. It does not clone, densify, or scrub padding when the existing storage is already valid for the requested layout.

Now that the AMDGPU backend supports Memory Model Relaxation hannotations (MMRAs) on fences to indicate that only LDS / only global memory should be waited on, there is no need to manually insert waitcnt instructions, as the backend will correctly insert only the requested waits. This makes the atomic behavior we're looking for transparent to LLVM IR and avoids low-level intrinsics that require a bunch of complex bit-packing logic the backend already knows how to do. There's been a soft request from the backend folks for people to migrate to this form of fence if possible (though we at least have the benefit that we're not using inline assembly). AI disclosure: I wrote up what needed doing and had Codex do the work here (though I have checked the logic and it looks correct).

Add bulk packing for fixed fields

Fast tree traversal and remove unused methods

# New contributor declaration - [x] I am not making a trivial change, such as fixing a typo in a comment. - [x] I have written a PR description following these [rules](https://cbea.ms/git-commit/#why-not-how). - [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`. - Select one of the following. - [x] I have added tests. - `/test` for `lit` tests - `/unittest` for C++ tests - `/python/test` for end-to-end tests - [ ] This PR does not need a test because `FILL THIS IN`. - Select one of the following. - [x] I have not added any `lit` tests. - [ ] The `lit` tests I have added follow these [best practices](https://mlir.llvm.org/getting_started/TestingGuide/#filecheck-best-practices), including the "tests should be minimal" section. (Usually running Python code and using the instructions it generates is not minimal.) ## Summary Make local and remote cache group reads reject incomplete groups instead of returning a partial mapping. A partial group can make compilation treat a corrupted cache entry as a cache hit, only to fail later when loading the missing binary or IR artifact. ## What changed - Return `None` from `FileCacheManager.get_group()` if any recorded child path is missing. - Return `None` from `RemoteCacheManager.get_group()` if the remote backend does not return all recorded children. - Add regression coverage for both local and remote cache managers. ## Testing - `pre-commit run --from-ref origin/main --to-ref HEAD` - Added Python unit coverage in `python/test/unit/runtime/test_cache.py` Co-authored-by: Kevin-Li-2025 <2242139@qq.com>

Skip empty leaf nodes. We don't skip empty internal nodes to make the code simpler and more efficient. In fact, most empty nodes are leaves (kernels)

Use more fixed strings

This updates Triton's Blackwell-specific packaged CUPTI toolchain entry to the published CUDA 13.3.35 artifact. This fix a memory leak in cupti

…10415) Currently desc.store(...) does not guaruntee that that write is completed to global memory, so this makes message passing impossible. e.g. ``` desc.store(...) tl.atomic_xchg(flag, 1, sem="release") ``` does not release the store. To maintain perf in gluon, I also expose the read-only tma wait variant so users can explicitly opt-in to the behavior.

pull Bot added ⤵️ pull merge-conflict Resolve conflicts manually labels Apr 30, 2025

pawelszczerbuk and others added 28 commits April 16, 2026 07:29

[FPSan] explicit lowerings for fmin/fmax (#10039)

97b099b

Co-authored-by: Adam P. Goucher <goucher@statslab.cam.ac.uk>

[AMD][CI] Switch to snapshot repos to workaround mirror issue (#10054)

d326357

[BACKEND] Implement multiCTA support for TMA gather/scatter (#9977)

eb5efe2

We also add tighter invariants for gather/scatter ops as well

[triton] Python typing improvements (#10048)

f62f95b

[BACKEND] Model async TMA variants in ConSan (#10015)

e9352e2

We follow TMA load / store closely.

Limit warp specialization partition attrs to WS pass (#10058)

3be1a23

Use TargetInfo for shared memory accesses (#10038)

fe0c38b

[GSan] Model TMAReduceOp as atomic (#10053)

07ef288

[GSan] Model TMAReduceOp as atomic Previously we were modelling it as a non-atomic store. This now models it as atomic, but is still not 100% right as we're doing the shadow cell update eagerly before the TMA operation has completed.

[gluon][translator] Upstream some fixes (#10069)

043ba3e

* Convert `mask` to a tensor so it broadcasts correctly in gluon. NFC for triton * fix the `tl_obj_*` helpers * add an extra hook for rewriting some argument annotations

[AMD][CI] Disable ASAN tests due to non-deterministic failures (#10071)

cdac714

The test has been failing increasingly freqently. Disable while investigating.

Revert "[AMD][CI] Switch to snapshot repos to workaround mirror issue" (

6533931

#10085) Reverts #10054 given now the mirror out of sync isue is gone.

Improve cache modifiers support for loads (#9936)

88e8e52

This PR: - Enables CS and CV load cache modifier in Nvidia backend - Enables CS cache modifier in frontend - Improves test coverage for CDNA3 and CDNA4 --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

[FPSAN] Insert barriers in between scratch loads and stores (#10055)

147a60d

Fpsan insufficiently synchronized global scratch accesses - we were missing a barrier in between memory accesses, which led to random failures in the tests and racy behavior.

adityasingh2400 and others added 30 commits May 26, 2026 14:13

[KERNELS] make setting idle sms process-global (#10380)

efbae9a

they can still separately be overriden in the opt flags constraints

[KERNELS] fix hopper mxfp4 swizzle bug (#10385)

0c3303d

[TEST] Add 2CTA tcgen05 Gluon coverage for transposed LHS shared layo…

76d5b92

…ut (#10345)

[Test] Add more tests for cross CTA local_load/local_store (#10344)

0475130

[AMD][gfx1250] Verify descriptor and smem encodings (#10372)

8631d02

This PR adds check that tensor descriptor and smem encodings are equal. --------- Co-authored-by: Alexander Efimov <efimov.alexander@gmail.com>

Update Blackwell ptxas to 13.3 (#10398)

ac7445b

Update the separate Blackwell ptxas pin from CUDA 13.1.80 to CUDA 13.3.33 so Blackwell builds use the CUDA 13.3 assembler artifact.

[PROTON] Reduce msgpack serialization overhead-1 (#10393)

c2451b6

Use `string_view` and `packFixStrLiteral`

[PROTON] Reduce msgpack serialization overhead-2 (#10394)

a69e649

Add bulk packing for fixed fields

[PROTON] Reduce msgpack serialization overhead-3 (#10395)

0b21f97

Fast tree traversal and remove unused methods

[PROTON] Reduce msgpack serialization overhead-4 (#10396)

c5b3f3a

Skip empty leaf nodes. We don't skip empty internal nodes to make the code simpler and more efficient. In fact, most empty nodes are leaves (kernels)

[PROTON] Reduce msgpack serialization overhead-5 (#10397)

9e0f8bf

Use more fixed strings

[Gluon] Add support for nv local_store_async (#10357)

af1bca5

Update Blackwell cupti to 13.3 (#10416)

091c40a

This updates Triton's Blackwell-specific packaged CUPTI toolchain entry to the published CUDA 13.3.35 artifact. This fix a memory leak in cupti

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from triton-lang:main#1

[pull] main from triton-lang:main#1
pull[bot] wants to merge 3453 commits into
maryamtahhan:mainfrom
triton-lang:main

pull Bot commented Apr 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

Conversation

pull Bot commented Apr 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

19 participants

pull Bot commented Apr 30, 2025 •

edited

Loading